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ABSTRACT 

In  order  to  realize  the  potential  of  highly  parallel  shared-memory  MIMD 
architectures  for  solving  very  large  problems,  novel  challenges  must  be  faced 
by  the  system  software  designer.  The  operating  system  must  endeavor  to 
utilize  all  processors  fuUy,  without  incurring  serial  bottlenecks  during 
coordination  operations.  Critical  code  sections  far  too  short  or  infrequent  to 
cause  performance  penalties  on  today's  machines  will  be  of  concern  on  very 
large  machines  because  the  cost  of  each  serial  section  rises  linearly  with  the 
number  of  processors  involved.  Further,  the  control  software  must  provide 
basic  facilities  to  support  a  structured  and  natural  style  of  general-purpose 
parallel  programming.  We  present  the  approaches  taken  for  satisfying  these 
requirements  in  the  NYU  Ultracomputer  design.  We  also  describe  our 
current  preliminary  parallel  operating  system,  derived  from  undc^,  which  is 
currently  running  on  an  eight-processor  prototype  Ultracomputer. 


1.   Introduction 

Continuing  advances  in  microelectronics  will  produce,  during  the  current 
decade,  megabit  memory  chips  as  well  as  high  speed  processor  chip  sets.  With  10- 
20  MIPS  (including  floating  point)  and  a  megabyte  soon  to  be  available  on  a  dozen 
chips,  one  is  led  to  consider  an  extraordinarily  powerful  machine  composed  of 
thousands  of  processors  and  gigabytes  of  memory.  Such  a  configuration  would 
yield    several     orders     of    magnitude    more    performance    than    do     current 
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supercomputers  from  roughly  the  same  component  count.   Moreover,   the  number^ 
of  distinct  components  would  be  quite  small  and  thus  the  design  appears  attractive.  ^ 

However,  it  remains  to  be  demonstrated  that  this  potentially  high-performance 
assemblage  can  be  effectively  utilized.  There  are  two  dimensions  to  this  challenge. 
First,  several  thousand  processors  must  be  coordinated  in  such  a  way  that  their 
aggregate  power  is  applied  to  useful  computation.  Serial  procedures  in  which  one 
processor  works  while  the  others  wait  become  bottlenecks  that  drastically  reduce  the 
power  obtained.  Indeed,  for  any  highly  parallel  architecture,  the  relative  cost  of  a 
serial  bottleneck  rises  hnearly  with  the  nimiber  of  processors  involved.  Second,  the 
machine  must  be  programmable  by  humans.  Effective  use  of  high  degrees  of 
parallelism  will  be  facilitated  by  simple  languages  and  facilities  for  designing, 
writing,  and  debugging  parallel  programs. 

We  propose  that  the  hardware  and  software  design  of  a  highly  parallel 
computer  should  meet  the  following  goals. 

•  Scaling.  Effective  performance  should  scale  upward  to  a  very  high  level. 
Given  a  problem  of  sufficient  size,  an  n-fold  increase  in  the  number  of 
processors  should  yield  a  speedup  factor  of  almost  n. 

•  General  purpose.  The  machine  should  be  capable  of  efficient  execution  of  a 
wide  class  of  algorithms,  displaying  relative  neutrality  with  respect  to 
algorithmic  structure  or  data  flow  pattern. 

•  Programmability.  High-level  programmers  should  not  have  to  consider  the^ 
machine's  low-level  structural  details  in  order  to  write  efficient  programs.' 
Programming  and  debugging  should  not  be  substantiaUy  more  difficult  than  on 

a  serial  machine. 

•  Multiprogramming.  The  software  should  be  able  to  allocate  processors  and 
other  machine  resources  to  different  phases  of  one  job  and/or  to  different  user 
jobs  in  an  efficient  and  highly  dynamic  way. 

Achievement  of  these  goals  requires  an  integrated  hardware/software  approach. 
The  burden  on  the  operating  system  designer  is  to  support  a  programming  model 
that  is  high-level  and  flexible,  to  schedule  the  processors  so  that  the  workload  is 
balanced,  and,  most  importantly,  to  avoid  critical  sections  that  would  constitute 
unacceptable  serial  bottlenecks. 

We  will  consider  these  and  subsidiary  issues  in  the  context  of  a  particular 
parallel  architectuire,  the  NYU  Ultracomputer.  This  design,  which  will  be  outlined 
in  the  following  section,  was  driven  by  the  above  objectives  plus  the  desire  that  the 
machine  be  buildable  without  requiring  unanticipated  advances  in  hardware 
technology  or  packaging. 

Following   the  architectural   description  we  will  consider  issues  in  parallel 
programming  that  impact  the  operating  system,  in  particular  the  system  services 
utilized  by  parallel  code,  and  questions  of  process  and  job  scheduling.    The  final 
section   will    describe   the    data   structures    and   the   resource   management    and  ^ 
synchronization  mechanisms  that  have  been  developed  for  an  ultraparallel  operating  " 
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system.    Many  of  these  building  blocks  have  been  implemented  in  an  operating 
system  currently  in  use  on  an  eight-processor  prototype  Ultracomputer. 

The  thrust  of  the  present  paper  is  to  present  the  system  software  considerations 
that  result  from  supporting  both  highly  parallel  application  programs  and  software 
development  on  the  Ultracomputer  architecture.  Readers  interested  in  the 
hardware  design  of  such  parallel  computers  are  encouraged  to  consult  Edler, 
Gottlieb,  et  al.  [85]  and  Gottlieb,  Grishman  et  al.  [83].  We  have  developed  a 
highly  parallel  extension  of  the  undc  operating  system,  which  has  been  implemented 
on  our  hardware  prototype.  A  description,  in  undc  specific  terms,  of  the  issues  that 
arise  can  be  found  in  Edler,  Gottlieb,  and  Lipkis  [85]. 


2.   Ultracomputer  Architecture 

In  this  section  we  review  the  architectural  model  on  which  the  Ultracomputer  is 
based,  and  Ulustrate  the  power  of  this  model.  Although  this  idealized  machine  is 
not  physically  realizable,  a  close  approximation  can  be  built.  Elements  of  the  actual 
machine  design  are  described  in  order  to  illustrate  integrated  hardware/software 
mechanisms  for  bottleneck-free  coordination  of  a  very  large  number  of  processors. 

2.1.   The  Model 

An  idealized  parallel  processor,  dubbed  a  "paracomputer"  by  Schwartz  [80] 
and  classified  as  a  WRAM  by  Borodin  and  Hopcroft  [82],  consists  of  a  number  of 
autonomous  processing  elements  (PEs)  sharing  a  central  memory^.  The  model 
permits  every  PE  to  read  or  write  a  shared  memory  cell  in  one  cycle.  In  particular, 
simultaneous  reads  and  writes  directed  at  the  same  memory  cell  are  effected  in  a 
single  cycle. 

In  order  to  make  precise  the  effect  of  simultaneous  access  to  shared  memory 
we  define  the  serialization  principle,  which  states  that  the  effect  of  simultaneous 
actions  by  the  PEs  is  as  if  the  actions  had  occurred  in  some  (unspecified)  serial 
order.  Thus,  for  example,  a  load  simultaneous  with  two  stores  directed  at  the  same 
memory  cell  will  return  either  the  original  value  or  one  of  the  two  stored  values, 
possibly  different  from  the  value  which  the  cell  finally  comes  to  contain.  Note  that 
simultaneous  memory  updates  are  in  fact  accomplished  in  one  cycle;  the  serialization 
principle  speaks  only  of  the  effect  of  simidtaneous  actions  and  not  of  their 
implementation. 


^To  be  more  specific.  Fortune  and  Wyllie  [78]  introduced  the  PRAM  (Parallel  Random  Access 
Machine),  a  multiple  processor  analogue  of  Aho,  Hopcroft,  and  Ullman's  [74]  RAM,  in  which 
concurrent  reads  of  a  single  location  are  permitted  but  concurrent  writes  are  not.  Snir  [82]  classifies 
this  machine  as  a  CREW  PRAM  (Concurrent  Read  Exclusive  Write).  Both  the  WRAM  and  the 
paracomputer  permit  concurrent  writes  and  are  thus  classified  as  CRCW  PRAMs  by  Snir.  Our 
model  is  essentially  a  CRCW  PRAM  augmented  by  the  fetch-and-add  coordination  primitive  defined 
below. 
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Several  points  need  to  be  noted  concerning  the  difficulty  of  a  hardware 
implementation  of  this  model.  First,  the  single  cycle  access  to  globally  shared 
memory  is  not  possible  to  achieve.  Indeed,  for  any  technology  there  is  a  limit,  say 
b,  on  the  number  of  signals  that  one  can  fan  in  at  once.  Thus,  if  N  processors  are 
to  access  even  a  single  bit  of  shared  memory,  the  shortest  access  time  possible  is 
logjjN.  As  will  be  seen,  hardware  closely  approximating  this  behavior  has  been 
designed.  But  it  does  not  come  for  free.  The  processor  to  memory  interconnection 
network  used  cannot  be  effectively  constructed  using  off  the  shelf  components;  a 
custom  VLSI  design  is  needed.  In  addition  to  increasing  the  design  time  for  such  a 
machine,  the  network  adds  to  its  replication  cost  and  size.  Thus,  for  a  fixed 
number  of  dollars  (or  cubic  feet,  or  BTUs,  etc.),  a  shared  memory  design  will 
contain  fewer  processors  or  memory  cells  than  wiU  a  strictly  private  memory  design 
constructed  from  the  same  technology.  Although  we  believe  that  the  lower  peak 
performance  inherent  in  shared  memory  designs  is  adequately  compensated  for  by 
their  increased  flexibility  and  generality,  this  issue  is  not  settled.  Most  likely  the 
answer  will  prove  to  be  so  application  dependent  that  both  shared  and  private 
memory  designs  will  prove  successful.  A  similar  tradeoff  between  peak 
performance  and  flexibihty  arises  in  the  choice  of  autonomous  processors,  i.e.  an 
MIMD  design*.  The  alternative  SIMD  architecture  class,  in  which  for  any  given 
cycle,  all  active  processors  execute  the  same  instruction,  has  already  proven  itself  to 
be  a  cost-effective  method  of  attaining  very  high  performance.  Indeed,  all  current  ^ 
vector  supercomputers  fall  into  this  class.  It  is  also  well  known,  however,  that 
converting  algorithms  to  vector  form  is  often  difficult  and  the  resulting  programs 
tend  to  be  brittle,  i.e.  hard  to  transport  to  different  hardware  or  to  modify  to  use 
improved  algorithms. 

2.2.   The  Fetch-and-add  Operation 

We  augment  the  paracomputer  model  with  the  "fetch-and-add"  (F&A) 
operation,  a  powerful  interprocessor  synchronization  operation  that  permits  highly 
concurrent  execution  of  operating  system  primitives  and  application  programs  (see 
Gottlieb  and  Kruskal  [81]).  Fetch-and-add  is  essentially  an  indivisible  add  to 
memory,  its  format  is  F&A(V,e),  where  V  is  an  integer  variable  and  e  is  an  integer 
expression.  The  operation  is  defined  to  return  the  (old)  value  of  V  and  to  replace 
V  by  the  simi  V-he.  Moreover,  concurrent  fetch-and-adds  are  required  to  satisfy 
the  serialization  principle  enunciated  above.  Thus  fetch-and-add  operations 
simultaneously  directed  at  V  would  cause  V  to  be  modified  by  the  appropriate  total 
increment  while  each  operation  yields  the  intermediate  value  of  V  corresponding  to 
its  position  in  this  order.  The  following  example  illustrates  the  semantics  of  fetch- 
and-add:  Consider  several  PEs  concurrently  executing  F&A(I,1),  where  I  is  a 
shared  variable  used  to  index  into  a  shared  array.    Each  PE  obtains  an  index  to  a 


*In  the  taxonomy  of  Flynn  [66],  MIMD  is  the  category  of  Multiple  Instruction  stream.  Multiple  ^ 
Data  stream  computers;  whereas  in  SIMD  designs  there  is  just  a  Single  Instruction  stream  for  the 
Multiple  Data  streams. 
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distinct  array  element  (although  one  cannot  predict  which  element  will  be  assigned 
to  which  PE),  and  I  receives  the  appropriate  total  increment. 

2.3.  The  General  Fetch-and-<|)  Operation 

Fetch-and-add  is  a  special  case  of  the  more  general  fetch-and-<i)  operation 
(where  <{>  may  be  an  arbitrary  binary  associative  operator)  introduced  by  Gottlieb 
and  Kruskal.  The  classic  test-and-set  and  compare-and-swap  synchronization 
operations  are  both  special  cases  of  fetch-and-<{)  as  well,  and  the  familiar  load  and 
store  operations  are  degenerate  cases.  For  example,  Test&Set(S)  is  equivalent  to 
Fetch&OR(S,TRUE),  while  Load(S)  may  be  accomplished  with  Fetch&iT^(S,*) 
where  'Tr^(x,y)  =  x  and  the  value  of  *  is  immaterial  (and  thus  need  not  be 
transmitted) . 

Although  experience  to  date  shows  fetch-and-add  to  be  the  most  important 
member  of  this  class,  other  fetch-and-<}>  operations  will  be  useful  as  well  in  the 
construction  of  viltraparallel  operating  systems.  For  example,  fetch-and-or  with 
fetch-and-and  provide  a  means  for  atomically  setting  and  clearing  bit  flags  within  a 
word  of  storage.  On  a  standard  architecture  without  these  operations,  bit  setting 
and  clearing  fimctions  require  critical  sections  protected  by  semaphores. 

2.4.  The  Povfer  of  Fetch-and-add 

Using  the  fetch-and-add  operation  we  can  perform  many  important  algorithms 
in  a  completely  parallel  manner,  i.e.  without  using  any  critical  sections.  For 
example,  as  indicated  above,  concurrent  executions  of  F&A(I,1)  yield  consecutive 
values  that  may  be  used  to  index  an  array.  If  this  array  is  interpreted  as  a 
(sequentially  stored)  queue,  the  values  returned  may  be  used  to  perform  concurrent 
inserts;  analogously  F&A(D,1)  may  be  used  for  concurrent  deletes.  The  complete 
queue  algorithms  contain  checks  for  overflow  and  underflow,  collisions  between 
insert  and  delete  pointers,  etc.  (see  Gottlieb,  Lubachevsky,  and  Rudolph  [83]). 
Forthcoming  sections  will  indicate  how  such  techniques  can  be  used  to  implement  a 
totally  decentralized  operating  system  scheduler.  We  are  unaware  of  any  other 
completely  parallel  solutions  to  this  problem.  To  illustrate  the  nonserial  behavior 
obtained,  we  note  that  given  a  single  queue  that  is  neither  empty  nor  full,  the 
concurrent  execution  of  thousands  of  inserts  and  thousands  of  deletes  can  all  be 
accomplished  in  the  time  required  for  just  one  such  operation.  The  importance  of 
critical  section  free  queue  management  routines  may  be  seen  in  the  following 
remark  of  Deo  et  al.  [80]. 

However,  regardless  of  the  number  of  processors  used,  we  expect  that  algorithm 
PPDM  has  a  constant  upper  bound  on  its  speedup,  because  every  processor 
demands  private  use  of  the  Q. 

For  another  example  of  the  power  of  fetch-and-add,  consider  the  classical 
readers-writers  problem  in  which  two  classes  of  processes  are  to  share  access  to  a 
central  data  structure.  One  class,  the  readers,  may  execute  concurrently;  whereas 
the  writers  demand  exclusive  access.    Although  there  are  many  solutions  to  this 
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problem,  only  the  fetch-and-add  based  solution  given  by  Gottlieb,  Lubachevsky, 
and  Rudolph  has  the  crucial  property  that  during  periods  of  no  writer  activity,  no 
critical  sections  are  executed^.  Other  highly  parallel  fetch-and-add-based  algorithms 
appear  in  Kalos  [81],  Kruskal  [81a],  and  Rudolph  [82]. 

2.5.   Hardware  Realization 

The  paracomputer  is  not  physically  realizable,  due  to  fan-in  (and  other) 
limitations.  Furthennore,  memory  modules  operate  sequentially;  only  one  load  or 
store  may  be  satisfied  in  one  cycle.  If  concurrent  fetch-and-add  or  load  operations 
were  to  be  serialized  at  the  memory  of  a  real  parallel  computer,  then  we  would  lose 
the  advantage  of  parallel  coordination  algorithms,  having  merely  moved  the  critical 
sections  from  the  software  to  the  hardware. 

In  fact,  a  parallel  processor  closely  approximating  our  idealized  paracomputer 
can  be  built.  Li  this  section  we  sketch  the  design  of  the  NYU  Ultracomputer,  and 
refer  the  reader  to  Edler,  Gottlieb,  et  al.  [85]  and  Gottlieb,  Grishman,  et  al.  [83] 
for  a  more  detailed  description.  The  Ultracomputer  uses  a  message  switching 
network  with  the  topology  of  Lawrie's   [75]  H-network  (equivalently,  the  SW 

Banyan  of  Goke  and  Lipovsky  [73])  to  connect  N  =  2^  autonomous  PEs  to  a 
central  shared  memory  composed  of  N  memory  modules  (MMs).  Note  that  the 
direct  single  cycle  access  to  shared  memory  characteristic  of  paracomputers  is 
approximated  by  an  indirect  access  via  a  multicycle  connection  network. 

2.5.1.  Xl- network  Enhancements  The  manner  in  which  an  H-network  can  be 
used  to  implement  memory  loads  and  stores  is  well  known  and  is  based  on  the 
existence  of  a  (imique)  padi  connecting  each  PE— MM  pair.  We  first  enhance  the 
basic  O-network  design  by  associating  queues  with  each  switch  to  enable  concurrent 
processing  of  requests  for  the  same  port^. 

When  concurrent  loads  and  stores  are  directed  at  the  same  memory  location 
and  meet  at  a  switch,  they  can  be  combined  without  introducing  any  delay  (see 
Klappholz  [80],  and  Gottlieb,  Lubachevsky,  and  Rudolph  [83]).  Combining 
requests  reduces  commimication  traffic  and  thus  decreases  the  lengths  of  the  queues 
within  the  switches,  leading  to  lower  network  latency  (i.e.  reduced  memory  access 
time).  Since  combined  requests  can  themselves  be  combined,  the  network  satisfies 
the  key  property  that  any  number  of  concurrent  memory  references  to  the  same 
location  can  be  satisfied  in  the  time  required  for  one  central  memory  access.  It  is 
this  property,  when  extended  to  include  fetch-and-add  operations,  that  permits  the 
bottleneck-free  implementation  of  many  coordination  protocols. 


« 


^Most  other  solutions  require  readers  to  execute  small  critical  sections  to  check  if  a  writer  is 
active  and  indivisibly  announce  their  own  presence.  The  "eventcount"  mechanism  of  Reed  and 
Kanodia  [79],  although  completely  parallel  in  the  above  sense,  detects  rather  than  prevents  the 
simultaneous  activity  of  readers  and  writers. 

*The  alternative  adopted  by  Burroughs  [79]  of  killing  one  of  the  two  conflicting  requests  limits 
bandwidth  to  0(N/log  N);  see  Kruskal  and  Snir  [83]. 
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2.5.2.  Implementing  Fetch-and-add  By  including  adders  in  the  MMs,  the 
fetch-and-add  operation  can  be  easily  implemented:  When  F&A(X,e)  reaches  the 
MM  containing  X,  the  value  of  X  and  the  transmitted  e  are  brought  to  the  MM 
adder,  the  sum  is  stored  in  X,  and  the  old  value  of  X  is  returned  through  the 
network  to  the  requesting  PE.  Since  fetch-and-add  is  our  sole  synchronization 
primitive  and  is  also  a  key  ingredient  in  many  algorithms,  concurrent  fetch-and-add 
operations  will  often  be  directed  at  the  same  location.  Thus,  as  indicated  above,  it 
is  crucial  that  a  design  supporting  large  numbers  of  processors  not  serialize  this 
activity. 

Enhanced  switches  permit  the  network  to  combine  fetch-and-adds  with  the 
same  efficiency  as  it  combines  loads  and  stores.  When  two  fetch-and-adds 
referencing  the  same  shared  variable,  say  F&ApC,e)  and  F&ApC,f),  meet  at  a 
switch,  the  switch  forms  the  sxim  e-l-f,  transmits  the  combined  request 
F&ApC,e-l-f),  and  stores  the  value  e  in  its  local  memory.  When  the  value  Y  is 
returned  to  the  switch  in  response  to  F&A(X,e-f-f),  the  switch  transmits  Y  to  satisfy 
the  original  request  F&ApC,e)  and  transmits  Y-f-e  to  satisfy  the  original  request 
F&ApC,f).  Assimiing  that  the  combined  request  was  not  further  combined  with  yet 
another  request,  we  would  have  Y  =  X;  thus  the  values  returned  by  the  switch  are 
X  and  X+e,  thereby  effecting  the  serialization  order  "F&A(X,e)  followed 
immediately  by  F&A(X,f)".  The  memory  location  X  is  also  properly  incremented, 
becoming  X-he-Hf.  If  other  fetch-and-add  operations  updating  X  are  encountered, 
the  combined  requests  are  themselves  combined,  and  the  associativity  of  addition 
guarantees  that  the  procedure  gives  a  result  consistent  with  the  serialization 
principle. 

Figure  1  is  a  photograph  of  a  VLSI  component  that  performs  the  forward  (i.e. 
processor  to  memory)  direction  of  the  above  procedure.  The  resisting  chip,  a  four 
micron  nMOS  design  fabricated  by  the  DARPA  MOSIS  facility,  has  been 
successfully  tested.  Due  to  pin  and  heat  limitations  only  a  very  narrow  data  path 
was  implemented.  Larger  pin  packages  are  available  but,  to  reduce  the  heat 
consumption,  we  have  subsequently  switched  our  efforts  to  CMOS.  See  Dickey  et 
al.   [85]  for  details  of  this  VLSI  work. 

Analogous  hardware  enables  the  machine  to  support  effectively  fetch-and-<f>  for 
other  associative  binary  operators  <t). 

2.5.3.  Processor  cache  The  impact  of  network  latency  on  performance  is 
reduced  by  associating  a  local  cache  memory  with  each  PE.  Frequently-used 
program  code  and  data  can  be  accessed  in  (approximately)  a  single  processor  cycle 
when  resident  in  the  cache.  Thus  the  network  latency  is  eliminated  from  many 
memory  accesses,  and  aU  accesses  benefit  from  the  reduced  network  traffic. 

Unfortunately,  cacheing  of  read-write  shared  variables  presents  a  coherence 
problem  among  the  various  caches.  Different  caches  will  in  general  come  to  contain 
different  values  for  the  same  variable,  and  updates  of  the  corresponding  memory 
cell  will  occur  out  of  sequence.   Means  are  needed  to  ensure  synchronized,  coherent 
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access  among  the  multiple  PEs  for  variables  used  for  coordination  or  data 
transmission.  An  obvious  mechanism,  which  is  employed  in  the  prototype 
XJltracomputer,  is  merely  to  prevent  cacheing  of  read-write  shared  variables.^  The 
software  is  required  to  distinguish  between  shared  and  private  variables,  typically 
by  grouping  them  into  separate  memory-management  segments,  and  communicate 
"cacheability"  information  to  the  hardware  on  each  memory  access. 

Because  the  cache  hit  ratio  is  a  major  factor  in  machine  performance, 
maximizing  cacheability  is  an  important  fimction  of  the  compiler  and  operating 
system  software.  This  suggests  supporting  a  more  elaborate  scheme  in  which 
shared  variables  that  are  accessed  read-only,  or  accessed  only  privately  during  a 
particular  code  segment,  may  be  cacheable  during  execution  of  that  segment  (see 
McAuliffe  [85]). 


3.   Parallel  Programming 

3.1.   Levels  of  Parallel  Control 

We  consider  applications  programming  of  Ultracomputer-like  parallel 
computers  primarily  from  the  standpoint  of  operating  system  design.  The  shared- 
memory  MIMD  model  can  support  a  number  of  different  styles  of  programming.^ 
Relative  efficiency  and  usefulness  of  these  alternatives  are  affected  by  performance 
issues  relating  to  the  underlying  software  implementation.  Automatic 
paraUelization  of  sequential  programs  is  not  addressed;  we  assimie  here  that 
programs  are  designed  originally  for  a  shared-memory  MIMD  computer.* 

Much  work  on  concurrent  programming  has  focused  on  a  high-level,  block- 
structured  paradigm  of  parallel  process  control.  Under  this  model  parallel  code  is 
structured  in  closed-form  constructs  with  implicit  synchronization  at  the  end  of  each 
parallel  block;  explicit  synchronization  and  "fork/join"  operations  (Qjnway  [63], 
Dennis  and  Van  Horn  [66])  are  discouraged  or  disallowed.  The  argimient  is  that 
the  resulting  parallel  code  is  simpler,  clearer,  and  easier  to  debug  than  code  with 
unrestrained  and  unstructured  process  creation/destruction  and  synchronization. 
Furthermore,  this  programming  style  obviates  in  many  cases  the  need  for  explicit 
shared/private  declarations.  That  distinction  is  instead  implicit  in  the  usual  static 
scope  rules.  Thus,  for  example,  a  variable  that  is  visible  within  a  block  defining  an 
"iteration"  of  a  parallel  loop,  but  declared  in  a  larger  enclosing  scope,  is  taken  to  be 


^Because  of  the  stochastic  nature  of  memory  access  through  the  Q-network,  this  may  not  be 
sufficient  to  insure  that  the  synchronization  is  maintained.  If  the  processor  or  cache  is  capable  of 
issuing  memory  requests  and  proceeding  without  waiting  for  acknowledgment  from  the  network, 
then  for  code  sensitive  to  synchronization  a  further  mechanism  (a  "FENCE"  operation)  is  needed  to 
guarantee  that  updates  are  sequenced  correctly  (Collier  [81]). 

*Howevcr,  the  automatic  techniques  of  Kuck  (see  Kuck  and  Padua  [79])  and  Kennedy  (see" 
Kennedy [80],    Allen    and   Kennedy    [84])   will   be   important   for   running   both   existing   and   new 
applications. 


Page  8 


shared  during  execution  of  the  parallel  loop;  a  variable  declared  within  the  block 
itself  has  scope  of  an  individual  iteration  and  hence  is  private.  Finally,  automatic 
optimization  of  parallel  code  is  facilitated  when  the  parallel  structure  of  a  program 
can  be  readily  detected  by  the  compiler.  It  should  be  possible,  for  example,  for  an 
optimizing  compiler  to  implement  a  fine  granularity  of  control  over  the  cacheability 
of  data  areas,  with  greater  reliability  (and  thus  avoidance  of  cache  coherence  errors) 
than  could  be  specified  by  the  programmer. 

However,  the  utility  and  effectiveness  of  this  high-level  parallel  programming 
style  is  not  yet  demonstrated.  Furthermore,  the  volatile  parallehsm  facilitated  by 
these  closed-form  parallel  constructs  will  not  be  needed  in  all  programs.  When 
parallel  processes  need  to  synchronize  very  frequently,  the  overhead  involved  in  the 
process  creation  and  termination  operations  invoked  by  these  constructs  will  become 
significant.  This  overhead  can  be  aUeviated  to  some  extent  by  pre-spawning 
processes  (see  section  3.5).  However,  many  parallel  applications  will  be  more 
suited  to  a  lower-level  programming  style.  One  such  approach  involves  the  initial 
creation  of  a  fixed  number  of  long-lived  processes,  usually  smaller  than  the  nximber 
of  PEs.  Synchronization,  scheduling,  and  memory  management  become  entirely  the 
responsibility  of  the  application  programmer.  Here  the  syntactic  structure  of  the 
program  provides  no  information  regarding  the  dynamic  parallelism  or  the  sharing 
of  data.   Ramifications  of  this  programming  model  will  be  discussed  in  section  4.2. 

3.2.   Parallel  Constructs 

"While  programming  language  issues  per  se  are  not  germane  to  this  paper,  for 
concreteness  we  present  examples  of  two  commonly  proposed  high-level  parallel 
constructs.  Note  that  programming  in  the  MIMD  parallel  environment  need  not  be 
radically  different  from  conventional  sequential  programming.  We  consider  parallel 
languages  which  are  variants  of  conventional  procedural  languages,  augmented  only 
with  a  shared/private  attribute  for  declared  variables  and  a  small  number  of  explicit 
parallel  control  constructs. 

3.2.1.  Control  Structures  The  parallel  loop,  in  which  homogeneous  iterates 
are  executed  in  parallel  instead  of  serially,  is  an  obvious  parallel  extension  of  the 
loop  construct  foimd  in  every  procedural  language  (see  Gosden  [66],  Droughon  et 
al.  [67],  Lundstrom  and  Barnes  [80],  and  Davies  [81]).  It  may  be  stated  as: 

forall  j  from  lb  to  ub  step  inc  do  s 

As  is  the  case  for  a  sequential  loop,  the  number  of  iterations,  /,  performed  over  the 
loop  body,  s,  is  controlled  by  the  variables  lb,  ub,  and  inc.  However,  the  i 
iterations  of  the  body  are  executed  in  parallel  with  each  evaluation  using  a  different 
value  for  j.  Each  iteration  is  represented  by  a  different  process;  the  operating 
system  arranges  that  each  process  is  scheduled  on  a  PE  for  execution.  The  loop 
construct  provides  implicit  synchronization;  that  is,  the  statements  following  the 
forall  block  will  be  executed  (serially)  only  after  all  of  the  parallel  iterations  have 
completed  executing  the  loop  body  s. 
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The  parallel  compound  statement,  or  parallel  block,  contains  inhomogeneous 
statements  that  are  to  be  executed  in  parallel.  It  is  also  a  popular  structure  and  has 
appeared  many  times  (e.g,  Dijkstra  [68],  Brinch-Hansen  [73],  and  the  collateral 
expression  of  ALGOL  68)  and  may  be  expressed  as  follows: 

parbegin 

•^l*  ■^2J  •••>  ^n 

parend 

•Jn  +  l 

First,  the  statement  ^o  is  executed;  then  statements  si  through  s„  are  executed  in 
parallel;  finally  statement  5„+i  is  executed. 

Neither  construct  contains  explicit  reference  to  processors.  The  PEs  themselves 
are  a  resource  that  is  only  indirectly  available  to  the  program.  Parallel  constructs 
generate  tasks  (also  called  processes)  which  run  on  processors.  Since  the  degree  of 
parallelism  (number  of  concurrently  active  tasks)  may  be  highly  volatile,  static 
assignment  of  processors  to  programs  is  not  in  general  desirable  (exceptions  will  be 
discussed  shortly).  Rather,  tasks  are  scheduled  on  processors  dynamically. 
Potentially,  all  of  the  tasks  generated  by  a  parallel  construct  could  execute 
simultaneously.  Whether  this  actually  occurs  is  a  function  of  scheduling  policy^ 
system  -  load,  and  various  characteristics  of  the  user  program  and  languag^ 
implementation  details. 

3.2.2.  Barrier  Synchronization  While  it  is  important  to  minimize  the  idle 
processor  time  caused  by  synchronization  among  parallel  tasks,  such 
synchronization  is  occasionally  necessary.  Barrier  s5aichronization  causes  each  of 
the  tasks  executing  a  common  routine  to  wait  at  the  "barrier",  or  synchronization 
point,  until  all  of  the  tasks  of  a  given  set  (e.g.,  those  processes  spawned  on  behalf 
of  a  particular  parallel  loop)  have  reached  that  point.  Algorithms  for  barrier 
synchronization  have  been  given  by  Rudolph  [82]  and  Kruskal  [81b]. 

The  barrier  is  specified  with  the  sync  statement.  This  form  of  synchronization 
is  employed,  for  example,  when  serial  loops  executed  by  parallel  processes  must 
execute  in  lock-step. 

3.3.   Implementation  of  Parallel  Constructs 

The  benefits  of  a  "high-level"  programming  style  were  discussed  above.  We 
now  must  consider  whether  that  style  can  be  supported  without  loss  of  efficiency  or 
flexibility.  In  particular,  we  must  ensure  that  opportunities  for  parallelism  in 
programs  are  not  sacrificed  due  to  the  parallel  language  implementation.  We  shall 
see  that  the  granularity  of  parallelizable  tasks  is  a  crucial  barometer  of  the 
effectiveness  of  the  implementation. 

The  basic  mechanism  provided  by  the  operating  system  for  creating  parallelism 
is  the  spawn  operation.    As  such,  it  is  the  natural  fadhty  with  which  to  support  the 
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parallel  loop  and  parallel  block  constructs.  Spawn  is  fundamentally  an  n-way  fork 
of  control,  in  which  n  identical  subtasks  are  created.  The  subtasks  are  made 
available  for  scheduling  on  any  available  PEs,  in  a  manner  to  be  discussed 
subsequently.  We  assume  here  that  the  "parent"  process  waits  for  the  termination 
of  its  spawned  "children",  which  occurs  automatically  at  the  end  of  the  parallel 
code  block.  Hence  the  parent  process  -  and  thus  following  program  statements  - 
are  synchronized  with  the  completion  of  the  spawned  parallel  processes. 

The  translation  of  a  parallel  block  into  runtime  code,  then,  might  make  use  of 
the  following  operating  system  primitives.  We  suppose  that  a  parallel  loop,  with  n 
homogeneous  iterates,  has  been  coded. 

spawn(n)  -  is  executed  by  the  original  (parent)  process.  It  stores  the  value  n  in  a 
shared  children  variable,  for  use  in  synchronizing  at  the  end  of  the  block,  then  adds 
n  items  to  the  system  work  queue. 

wait  —  is  executed  by  the  parent  following  the  spawn.  The  parent  process  blocks 
imtil  resumed  subsequently  by  an  active  process. 

terminate  —  is  executed  by  the  spawned  child  tasks  at  the  end  of  the  loop  body. 

terminate  records  the  process  termination  with  a  F&A{children,—  l)  operation;  if 

this  drives  children  to  zero  then  the  current  process  is  the  last  terminating  child  and 

thus  resumes  the  parent  process.    Otherwise,  terminate  calls  gettask  to  obtain  new 

work. 

gettask  —  deletes  a  process  from  the  central  work  pool,  and  initiates  its  execution. 

An  obvious  implementation  for  the  central  work  pool  is  the  fetch- and- add 
based  parallel-access  queue  which  was  mentioned  earlier. 

3.4.  Memory  Management 

A  major  issue  in  the  implementation  of  block  structured  parallel  languages  is 
that  of  stack  management.  Since  the  iterates  or  statements  of  a  parallel  loop  or 
block  are  executed  concurrently,  a  single  run-time  stack  is  insufficient.  Instead,  a 
"cactus  stack"  organization  (Hauck  and  Dent  [68])  is  appropriate.  A  different  stack 
frame  is  allocated  to  hold  local  data  and  control  information  for  each  spawned  task, 
and  these  frames  are  all  linked  to  the  parent's  stack  frame.  There  is  no  serial 
overhead  in  creating  or  destroying  these  private  frames  since,  as  we  shall  see, 
concurrent  memory  requests  for  stack  space  can  be  processed  in  parallel. 

3.5.  Performance  Issues  in  Parallel  Control 

There  are  two  significant  performance  criteria  by  which  any  implementation  of 
these  operating  system  primitives  must  be  evaluated.  First,  we  wish  to  avoid 
algorithms  that  require  time  linear  (or  worse)  with  respect  to  the  number  of 
spawned  tasks.  Effective  exploitation  of  high  degrees  of  parallelism  will  quickly 
vanish  if  the  scheduling  of  parallel  tasks  requires  a  serial  operation  that  is  a  linear 
function  of  the  degree  of  parallelism.  Hence  it  is  unacceptable  to  implement  a 
spawn  of  n  processes  by  a  sequence  of  n  insert  operations  on  a  system  task  queue. 
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Instead,  we  would  like  to  have  a  spawn  operation  insert  a  single  item  with  "weight" 
n  on  the  task  queue,  in  constant  time.  Such  an  implementation  is  discussed  in 
section  4.1.  A  constant-time  spawn,  a  fully  parallel  (critical-section-free)  gettask, 
and  fully  parallel  memory  allocation  routines  will  avoid  the  phenomenon  of 
diminishing  marginal  benefits  when  the  degree  of  parallehsm  is  increased. 

However  the  overhead  (in  absolute  tenns)  of  these  operations  is  also  crucial, 
because  it  determines  the  granularity  of  parallel  tasks  for  which  the  scheduling 
mechanisms  presented  so  far  in  this  section  are  useful.  If  we  estimate  the  total 
number  of  instructions  executed  by  spawn,  gettask,  and  terminate,  including  not 
only  the  insert  and  delete  operations  on  the  system  task  queue  but  also  the  process 
instantiation  functions  of  memory  allocation  (for  private  variables  and  stack  frame) , 
loading  of  memory-management  registers,  e*c.,  then  it  becomes  clear  that 
dynamically  establishing  the  parallel  context  for  a  forall  or  parbegin  block  is  of 
nontrivial  expense.  Although  the  basic  unit  of  parallelism  provided  by  the  language 
constructs  is  the  program  statement,  it  is  clear  that  spawning  tasks  that  each  execute 
a  trivial  statement  (e.g.,  one  assignment  or  arithmetic  operation)  would  result  in 
great  inefficiencies.  While  measures  can  be  taken  to  reduce  this  overhead,  for 
example  by  careful  design  of  memory  allocation  and  management  subsystems,  the 
facilities  described  to  this  point  must  be  considered  useful  only  for  large-granularity 
parallel  tasks.  This  limitation  can  be  alleviated  in  several  ways. 

Parallel  loop  iterates  of  smaller  granularity  can  often  be  supported  with  a  policy^ 
known  as  chunking.  When  the  nimiber  of  parallel  iterates  (n)  is  much  larger  than 
the  number  of  available  PEs  (N),  or  the  nimiber  of  instructions  in  the  body  of  the 
loop  is  small  compared  to  the  overhead  of  task  creation,  then  the  loop  body  can  be 
transformed  into  a  nested  loop  consisting  of  k  serial  iterations,  where  k  <  n/N,  and 
the  size  of  the  spawn  changed  to  n/k.  Thus  the  scheduling  overhead  per  iterate  is 
reduced  by  a  factor  of  k  invisibly  to  the  programmer,  without  dinainishing  the 
effective  parallelism.  The  value  of  k  is  most  appropriately  determined  at  runtime  as 
a  function  of  the  number  of  PEs  available  and  possibly  of  system  load.  Care  must 
be  taken  to  avoid  setting  k  too  high  and  nullifying  the  load-balancing  properties  of 
the  scheduling  mechanism  (see  section  4.1).  Kruskal  and  Weiss  [84]  have  examined 
the  performance  of  such  chunking  policies  and  argue  that  even  very  naive  schemes 
can  perform  acceptably  well.  A  corresponding  policy  for  inhomogeneous  parallel 
blocks  is  to  detect  at  compile  time  cases  where  the  granularity  of  parallelism  is  too 
small  for  efficient  execution,  and  simply  replace  the  short  parbegin  blocks  by  serial 
blocks. 

Effective  granularity  of  programmed  parallel  tasks  can  be  further  reduced  by 
pre-spawning  processes.  Here  either  the  programmer  or  compiler  causes  a 
sufficient  nimiber  of  tasks  to  be  spawned  in  advance  of  any  parallel  constructs  and 
immediately  suspended.  The  forall  or  parbegin  code  then  activates  these  tasks  by 
sending  a  message  or  a  signal,  thus  "creating"  parallel  threads  of  control  without 
the  overhead  of  operating  system  calls.  Pre-spawning  is  less  effective  when  the|| 
degree  of  parallelism  is  volatile. 
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Finally,  we  can  pre-spawn  non-preemptable  tasks  that  bvisy-wait  until  needed 
for  a  parallel  segment  of  code.  Far  finer  granularities  of  parallel  tasks  can  thus  be 
programmed  with  high-level  constructs,  since  virtually  all  of  the  overhead  of 
process  creation,  scheduling,  and  context  switching  is  eliminated.  It  remains  to  be 
seen  whether  such  a  policy  will  constitute  an  avenue  toward  optimally  efficient 
execution. 

3.6.   Performance  Issues  in  Barrier  Synchronization 

The  sync  operation  may  be  implemented  in  either  of  two  ways,  analogously  to 
the  situation  of  process  spawning.  It  is  always  correct  for  the  processor  to  suspend 
the  current  task  while  waiting  for  the  synchronization  to  complete,  and  switch  to 
another  task.  However,  considerable  operating  system  overhead  is  incurred.  The 
alternative  is  a  busy-waiting  synchronization  routine  which  simpl^  loops  testing  a 
coimter  until  the  synchronization  condition  is  satisfied.  The  overhead  involved  is 
significantly  less  for  busy-waiting  than  for  task  switching,  providing  the  waiting 
period  is  short.  However,  busy-waiting  admits  the  possibihty  of  deadlock  when  the 
nimiber  of  synchronizing  processes  is  greater  than  the  nuxnber  of  available 
processors  (and  no  preemption  is  permitted).  Even  when  deadlock  cannot  occur  it 
is  possible  that  a  nimiber  of  processors  will  loop  unproductively  for  extended 
periods  of  time,  especially  in  a  multiprogramming  environment. 

A  hybrid  sync  implementation,  in  which  each  process  busy-waits  for  a  short 
period  and  then,  if  necessary,  yields  the  PE,  might  be  used  to  advantage. 
Rescheduling  overhead  would  be  avoided  in  those  cases  where  busy-waiting  is 
suitable,  and  deadlock  would  be  prevented.  Synchronization  has  implications  for 
process  and  job  scheduling  that  are  addressed  in  the  following  section. 


4.   Process  scheduling 

4.1.  The  "Self-Service"  Paradigm 

Operating  systems  for  uniprocessors  and  some  multiprocessors  often  contain  an 
active  component  which  schedules  use  of  the  processor(s)  by  assigning  tasks  as 
appropriate.  While  this  centralized  approach  assures  favorable  load  balancing,  it 
introduces  a  serial  bottleneck  which  will  limit  overall  performance  on  highly  parallel 
machines.  We  may  alleviate  this  by  designating  two  or  more  schedulers,  each 
managing  a  portion  of  the  processors.  This  distributed  approach  leads  to  an 
interesting  tradeoff:  if  the  number  of  schedulers  is  small,  scheduling  bottlenecks 
arise;  if  the  number  is  large,  effective  load  balancing  suffers. 

Stochastic  distributed  scheduling  offers  one  solution.  Here  a  work  queue  is 
maintained  for  each  PE.  Whenever  a  task  is  created  (e.g.,  via  spawn)  on  any  PE, 
that  PE  assigns  the  task(s)  on  a  random  basis  to  one  of  the  work  queues.  In  concert 
with  time-slicing,  or  when  certain  constraints  (on  the  variance  of  task  lengths)  hold, 
this  mechanism  can  effectively  balance  the  load,  possibly  at  a  cost  in  scheduling 
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overhead  (see  Klappholz  [82]).  It  does  not  permit  the  constant-time  spawning  of 
multiple  processes  disoissed  earlier  since  the  assignment  of  each  task  must  be 
randomized  individually. 

The  scheduling  paradigm  adopted  for  process  scheduling  (and  other  resource 
management  functions)  in  the  Ultracomputer  is  that  of  a  self-service  system.  We 
maintain  a  single  central  queue  of  ready  tasks.  Each  processor  accesses  this  shared 
queue  to  obtain  tasks  for  execution  and  to  insert  newly-spawned  tasks.  This  self- 
service  paradigm  relies  on  simultaneous  distributed  processing  of  centralized  data, 
and  is  highly  dependent  on  concurrently-accessible  data  structures  that  allow 
operations  to  be  performed  without  locking  or  other  serializing  techniques.  The 
fetch-and-add  based  mechanisms  described  earlier  are  crucial  in  implementing  these 
critical-section-free  operations,  e.g.,  queue  insertion  and  deletion. 

The  queue  algorithms  used  are  enhanced  variants  of  the  algorithms  mentioned 
earlier  (a  drcular  array  with  insert  and  delete  pointers  on  which  fetch-and-add  is 
used  to  obtain  unique  entries).  In  addition  to  providing  greater  flexibility  in  storage 
management  for  queue  entries  and  greater  protection  against  certain  boundary 
conditions,  these  variants  support  three  important  additional  features: 

(1)  Multiplicity.  A  spawn  of  k  processes  is  implemented  by  the  insertion  of  a  task 
template  with  miiltiplidty  k.  The  queued  item  will  be  "deleted"  k  times  before 
actually  being  removed  from  the  queue.  ^ 

(2)  Priorities.  Tasks  may  be  queued  on  the  central  ready  queue  at  any  one  of  a 
(possibly  fixed)  nimiber  of  priorities.  Delete  operations  select  the  available 
item  with  highest  priority.^ 

(3)  Interior  removal.  In  order  to  swap  out  a  task  from  memory,  or  to  prematurely 
terminate  a  task,  it  is  occasionally  necessary  to  "delete"  an  item  from  the 
middle  of  a  queue  (Wilson  [85]). 

Distributed  scheduling  from  a  central  ready  queue  achieves  optimal  load 
balancing  among  the  PEs  and  also  facilitates  multiprogramming.  Each  of  several 
unrelated  jobs  may  contribute  tasks  to  the  ready  queue;  aU  will  be  scheduled  as  PEs 
become  available.  When  programs  with  volatile  parallelism  are  run, 
multiprogramming  can  improve  throughput  by  allowing  serial  sections  and  highly 
parallel  sections  of  different  jobs  to  be  overlapped.  K  desired,  time-slicing  can  be 
used  to  improve  the  average  turnaround  time  of  multiprogrammed  parallel  jobs. 


'Among  other  more  obvious  fuoctions,  task  priorities  are  needed  in  management  of  nested 
spawns.  The  dynamic  process  structure  of  a  program  in  which  spawns  are  nested  several  levels  deep 
may  be  depicted  as  a  tree,  in  which  spawned  child  processes  are  represented  by  nodes  that  are 
descendants  of  their  parent  process  node.  If  the  program  is  executed  such  that  the  process  tree  is 
traversed  breadth-first,  then  there  is  a  danger  that  because  the  processes  at  each  level  are  spawning 
more  processes  before  any  process  may  terminate,  the  capacity  of  memory  or  system  tables  may  be 
overwhelmed  by  an  exponential  explosion  in  instantiated  processes.  This  is  avoided  by  ensuring| 
depth-first  traversal  of  the  process  tree,  which  may  be  achieved  by  enqueuing  the  template  for 
creation  of  spawned  children  on  the  ready  queue  at  a  priority  greater  than  that  of  the  parent. 
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4.2.  Job  Scheduling 

This  simple  model  of  a  central  pool  of  individual  tasks  is  not  adequate  to 
effectively  sdiedule  all  types  of  parallel  applications.  Many  parallel  program 
designers  will  need  the  capability  to  manage  a  set  of  PEs  at  a  lower  level  than  that 
whidi  results  from  use  of  implicit  parallel  control  constructs.  Such  programs  will  be 
generally  static  in  their  degree  of  parallelism,  and  will  have  little  need  of  operating 
system  scheduling  facilities  after  an  initial  phase.  In  fact,  since  such  a  program  can 
implement  its  own  internal  scheduling,  synchronization,  and  memory  management, 
it  will  be  desirable  to  eliminate  the  process  instantiation  and  context  switching 
overhead  associated  with  operating  system  scheduling  services. 

We  propose  to  add  a  variant  of  the  spawn  system  primitive  to  support  non- 
preemptable,  related  groups  of  parallel  tasks.  When  granting  a  non-preemptable 
spawn  request,  the  operating  system  first  ensures  that  the  total  number  of  non- 
preemptable  tasks  will  not  exceed  the  number  of  processors,  then  arranges  that  the 
spawned  tasks  be  assigned  PEs.  Thus  long-nmning  tasks  stemming  from  a 
particular  spawn  may  be  immune  from  time-slicing  and  other  forms  of  preemption. 
In  order  to  see  why  this  is  needed  we  consider  a  spectrum  of  "organizational  styles" 
of  parallel  programs. 

(1)  On  the  extreme  (static)  end  of  the  scale  are  jobs  that  are  static  in  their  use  of 
PEs.  and  are  subject  to  real-time  constraints.  Tasks  associated  with  such  jobs 
would  not  be  preempted  under  any  circumstances. 

(2)  More  important  is  the  class  of  non-real-time  applications  referred  to  above, 
which  are  designed  to  coordinate  parallel  activities  on  a  fixed  group  of  PEs. 
Many  such  programs  may  be  characterized  by  very  frequent  internal 
synchronization.  Busy-waiting  synchronization  is  appropriate  because  the 
overhead  of  even  occasional  process  blocking  and  rescheduling  might  be 
prohibitive;  furthermore,  if  all  of  the  constituent  processes  are  guaranteed  to  be 
executing  concurrently  then  busy-waiting  is  efficient.  Such  a  job  may  be 
swapped  out,  if  desired  because  of  tumaroimd  considerations,  as  long  as  all  of 
the  processes  are  swapped  in  and  out  together. 

(3)  Jobs  displaying  dynamic  parallelism  are  to  some  extent  better  suited  to  our 
original  model  of  task  scheduling.  Although  internal  synchronization  will  still 
be  needed  occasionally,  it  can  be  adequately  managed  with  the  hybrid 
synchronization  mechanism  proposed  above,  even  in  the  presence  of  chunking. 
However  there  remain  several  reasons  to  arrange  scheduling  mechanisms  that 
tend  to  keep  related  ("sibling")  processes  executing  concurrently.  First,  there 
may  exist  algorithms  whose  performance  is  improved  when  parallel  processes 
execute  more  or  less  evenly,  that  is,  when  scheduling  policies  tend  to  maximize 
the  execution  rate  of  the  slowest-progressing  task  in  the  spawned  set.  Second, 
the  effectiveness  of  the  processor  cache  will  be  improved  when  successive  tasks 
executed  on  the  same  PE  come  from  the  same  job,  since  they  are  then  likely  to 
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reuse  cache  entries  for  program  code.^° 

The  structure  of  the  central  ready  queue  is  further  complicated  by  the  need  to 
recognize  groups  of  sibling  tasks  in  swapping  and  scheduling.  However,  the 
original  fetch-and-add  based  notion  of  bottleneck-free  inserts  and  deletes  is 
maintained. 


5.   Building  the  Parallel  Operating  System 

Since  programs  written  for  an  Ultracomputer-like  machine  are  anticipated  to 
generate  hundreds  or  thousands  of  concurrent  activities,  we  must  be  prepared  to 
accommodate  a  correspondingly  high  level  of  simultaneous  requests  for  operating 
system  services.  Serial  processing  of  these  requests  will  generate  unacceptable 
bottlenecks  on  a  large  machine.  Therefore,  the  operating  system  must  itself  be  a 
parallel  program,  decentralized  and  free  of  serial  code  sections  that  would  cause 
bottlenecks. 

We  have  already  outlined  the  implementation  of  processor  scheduling  for  these 
constraints.  In  most  respects  the  parallel  operating  system  faces  the  same 
coordination  problems  as  any  parallel  program  that  is  "low  level"  in  the  sense 
defined  earlier.  The  synchronization  primitives  and  data  structures  described  below 
in  the  context  of  the  operating  system  are  equally  applicable,  with  minor 
implementation  differences,  in  user  applications. 

The  algorithms  and  mechanisms  described  have  been  implemented  in  an 
experimental  operating  system  which  is  running  on  a  prototype  (8-processor) 
Ultracomputer.  Based  on  undc  Version  7,  the  experimental  system  is  symmetric 
(i.e.,  there  is  no  master-slave  relationship)  as  well  as  parallel.  It  incorporates 
rudimentary  but  usable  facilities  for  parallel  applications  programs. 

5.1.   Data  Structures 

All  operating  system  functions  rely  heavily  on  the  use  of  shared  data  structures 
that  can  be  efficiently  accessed  in  parallel.  In  this  section  we  consider  a  few 
particular  structures  that  have  proven  useful.  Once  again,  the  fact  that  multiple 
simultaneous  references  to  the  same  memory  location  can  be  accomplished  in  the 
time  required  for  one  reference  allows  us  to  avoid  serial  bottlenecks. 

Note  that  the  linked  list,  a  popular  data  structure  of  wide  application  in  (serial) 
resource  and  data  management  code,  is  not  suitable  where  concurrent  manipulation 
is  required.  We  know  of  no  algorithm  for  deleting  items  in  a  linked  list  without 
locking  out  some  other  accesses,  or  for  assigning  individual  list  items  to  different 
PEs  without  serialization.  The  desirable  characteristics  of  linked  lists  must  be 
found  in  other  structures. 


'  Here  we  assume  that  the  cache  architecture  permits  retaining  of  cache  lines  across  a  context    V 
switch. 
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5.1.1.  Queues  We  have  seen  the  central  role  played  by  the  fetch-and-add  based 
parallel  queue  in  process  scheduling.  Similar  queues  are  used  by  synchronization 
primitives:  the  set  of  processes  waiting  for  a  lock  to  be  released,  or  for  an  event, 
are  held  on  a  queue  (see  section  5.3.2).  Parallel  queues  are  useful  in  managing  a 
parallel-accessed  pool  of  items  even  where  there  is  no  ordering  required. 

Parallel  queues  used  in  the  prototype  operating  system  are  of  somewhat 
different  structure  from  the  simple  fixed-size  circular  array  discussed  earlier. 
Queues  of  unboimded  size  are  implemented  by  associating  a  linked  list,  protected 
by  a  semaphore,  with  each  element  of  the  fixed-size  array.  An  insertion  at  array 
element  j  appends  its  item  to  the  list  associated  with  element  j.  The  maximum 
concurrent  access  supported  by  this  structure  equals  the  number  of  lists,  i.e.,  the 
size  of  the  fixed  array  (Rudolph  [82]). 

5.1.2.  Hash  tables  A  similar  structure  results  from  the  use  of  hash  tables  to 
access  indexed  (dictionary)  information.  The  size  of  the  hash  table  (number  of 
buckets)  is  set  according  to  the  desired  maximum  concurrency,  usually  the  number 
of  PEs  times  a  small  factor.  Buckets  are  linked  lists,  protected  by  readers-writers 
locks,  so  that  in  a  high  percentage  of  cases  items  are  accessed  with  no  serialization. 
See  section  5.3.1  for  an  example  of  this  structure. 

5.2.  Memory  Allocation 

The  creation  of  a  new  task  requires  allocation  of  space  for  its  private  variables 
as  well  its  cactus  stack  frame.  As  with  task  management,  we  adopt  a  self-service 
mechanism.  The  processor  that  retrieves  a  task  from  the  ready  queue  executes  the 
memory  allocation  code.  A  number  of  parallel  algorithms  for  memory  management 
have  been  designed,  induding  (non-demand)  paging,  two  variants  of  the  Buddy 
System  (Knuth  [68]),  and  a  boundary  tag  method  (Knuth  [68]).  All  are  parallel 
analogs  of  serial  algorithms.  All,  except  for  one  of  the  Buddy  System  variants, 
maintain  queues  (whereas  their  serial  analogs  keep  lists)  of  free  memory  blocks. 
Insertions,  deletions,  and  accesses  to  blocks  within  a  concurrently-accessible  queue 
are  at  the  core  of  these  algorithms;  as  a  consequence  we  obtain  critical-section-free 
memory  allocation. 

5.3.  Coordmation  Primitives 

In  order  to  permit  tasks  to  cooperate,  it  is  often  necessary  to  coordinate  their 
accesses  to  shared  data  structures.  An  ideal  situation  for  parallel  execution  occurs 
when  completely  asynchronous  behavior  is  permitted,  as  in  some  "chaotic" 
algorithms.  When  restricting  access  patterns,  one  must  be  careful  to  permit  as 
much  parallelism  as  possible.  In  this  section  we  discuss  three  mechanisms  for  task 
coordination  that  have  been  successfully  used  in  our  prototype  multiprocessor 
system,  namely  counting  semaphores,  readers/writers  locks,  and  events.  When 
designing  such  access  mechanisms,  one  must  specify  whether  a  processor  denied 
access  should  (busy)  wait,  or  suspend  execution  of  the  current  task  and  switch  to 
another. 
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5.3.1.  Busy-Waiting  Synchronization  Despite  the  potential  for  waste  in 
busy-waiting,  there  are  several  reasons  for  using  it,  including  potentially  low 
overhead  and  applicability  to  situations  where  context  switching  is  inappropriate^^. 
There  are  many  possible  busy-waiting  mechanisms;  we  have  used  counting 
semaphores  and  readers/writers  synchronization  extensively,  and  have  described 
algorithms  for  them  before  (Gottlieb,  Lubachevsky,  and  Rudolph  [83]). 

Semaphores  are  often  used  to  implement  simple  locks  to  serialize  access  to  a 
small  partition  of  a  larger  concurrently  accessible  data  structure;  for  example,  the 
linked  Usts  used  in  the  queues  of  unrestricted  size  of  section  5.1.1  are  protected  with 
semaphores. 

Readers/writers  synchronization  is  used  naturally  in  cases  where  exclusive 
access  is  required  only  infrequently.  Our  implementation  also  supports  upgrading  a 
read  lock  to  a  write  lock  and  downgrading  a  write  lock  to  a  read  lock.  This  has 
been  used  to  implement  search  structures  which  must  support  the  operation:  "search 
for  an  item,  and  insert  it  if  not  foimd".  Such  a  structure  uses  linked  lists  accessed 
through  a  hash  table,  as  described  in  section  5,1.2.  By  performing  the  search  with  a 
read  lock,  serialization  is  avoided  in  many  cases  (the  most  important  case  is  where 
many  processors  search  for  the  very  same  item).  Only  if  the  item  is  not  found  is  it 
necessary  to  upgrade  the  lock.  The  upgrade  operation  can  succeed  or  fail;  the 
processors  that  fail  go  back  to  search  again,  while  the  (one)  processor  that  succeeds 
performs  the  insert.  These  search  structures  have  been  widely  applicable  in  our 
prototype  operating  system,  e.g.  in  managing  file  system  information. 

5.3.2.  Non-Busy-Waiting  Synchronization  Non-busy-waiting 
synchronization  is  commonly  used  in  multiprogramming  systems  since  it  allows  a 
processor  to  go  on  to  more  useful  work  when  the  progress  of  the  current  task  is 
logically  blocked.  Rather  than  waiting  for  progress  to  be  possible  again,  the 
processor  places  the  current  task  on  a  queue  associated  with  the  condition  to  be 
satisfied,  and  executes  the  next  task  from  the  ready  queue.  When  the  condition  is 
eventually  satisfied,  the  blocked  task  is  moved  from  the  waiting  queue  to  the  ready 
queue.  Although  this  organization  is  fundamental  to  all  multiprogramming 
systems,  the  exact  forms  of  the  synchronization  primitives  vary. 

The  best  examples  of  the  use  of  non-busy-waiting  synchronization  come  from 
the  area  of  I/O  processing.  This  takes  on  special  significance  for  the 
Ultracomputer,  because  the  number  of  tasks  performing  related  I/O  operations  can 
be  large.  For  example,  searching  of  important  file  system  directories  would  be  a 
bottleneck  if  serialized.  It  is  very  likely  that  a  group  of  tasks  reading  such  a 
directory  (or  other  frequently-accessed  file)  would  all  simultaneously  attempt  to 
read  the  same  disk  block;  serialization  would  be  devastating. 

At  a  low  level,  physical  I/O  devices  require  seriaUzation;  this  is  easily  provided 
by  semaphores.   Apparent  parallelism  for  disks  and  similar  devices  can  be  achieved 


"E.g.,  in  the  implementation  of  non-busy-waiting  mechanisms. 
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through  the  use  of  in-memory  buffer  cacheing.  Once  a  disk  block  is  copied  into  a 
memory  buffer,  there  is  no  obstacle  to  concurrent  access  for  reading; 
readers/writers  synchronization  is  appropriate  in  this  situation. 

Events  are  often  associated  with  external  occurrences,  such  as  the  completion 
of  an  I/O  operation  or  the  arrival  of  input  from  a  user's  terminal.  At  such  a  time, 
the  event  is  signaled,  remaining  in  this  state  vmtil  reset.  Additional  signals  (before 
the  reset)  have  no  effect. 

There  is  a  special  difficulty  in  designing  efficient  algorithms  for  readers/writers 
and  events,  because  the  most  natiiral  implementation  would  require  a  single  task  to 
completely  empty  a  queue,  introducing  a  serial  bottleneck.  When  a  writer  releases 
its  lock,  it  must  awaken  all  waiting  readers,^  and  when  an  event  is  signaled,  all 
tasks  waiting  for  it  must  be  awakened.  We  have  developed  two  approaches  to  this 
problem.  One  approach  is  to  move  the  wait  queue  as  a  single  object  onto  the 
system  ready  queue,  where  it  will  be  treated  in  much  the  same  way  as  an  item  with 
multiplicity.  Another  approach  is  to  let  each  newly  awakened  task  "help  out"  by 
waking  up  other  tasks  until  the  wait  queue  is  empty.  This  approach  is  less  complex 
that  the  "queue  of  queues"  method,  but  requires  more  time.  A  variation  on  this 
method  is  to  spawn  a  set  of  high  priority  tasks  (with  multiplicity  proportional  to  the 
size  of  the  wait  queue)  to  assist  with  the  wake  up. 

As  mentioned  above,  many  events  are  associated  with  physical  I/O.  Some  I/O 
events  might  never  happen,  e.g.  input  from  a  user's  terminal.  Thus  it  must  be 
possible  to  terminate  a  task  that  is  blocked  on  such  an  event^.  We  have  developed 
a  method  of  premature  unblocking  for  all  of  the  non-busy-waiting  synchronization 
primitives  described  above.  This  requires  removing  the  task  from  the  middle  of  the 
wait  queue  (Wilson  [85]).  In  each  case,  care  must  be  taken  to  insure  that  the  state 
of  the  synchronization  variable  (e.g.  semaphore  value)  remains  consistent. 


6.   Conclusion 

Previous  shared-memory  multiprocessors  have  been  limited  to  a  low  degree  of 
parallelism.  The  construction  of  highly  parallel  machines  requires  that  no  serial 
bottlenecks  are  introduced  by  either  hardware  or  software.  The  fetch-and-add 
synchronization  primitive  provides  a  simple  and  powerful  means  for  achieving  this 
goal  by  permitting  programmers  to  employ  shared  data  structures  without  relying 
on  critical  sections.  We  believe  that  parallel  operating  systems  can  be  built  that 
perform  all  fimctions  of  resource  management,  scheduling,  and  coordination  using 
critical-section-free  algorithms.  The  prototype  NYU  Ultracomputer  operating 
system  has  been  designed  to  demonstrate  the  feasibility  of  this  approach. 


*^It  is  usually  desirable  for  writers  to  have  priority,  so  the  readers  are  only  to  be  awakened  if 
there  are  no  more  waiting  writers. 

''Other,  more  sophisticated,  actions  are  also  possible,  e.g.  UNIX  signal  handling. 
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Effective  use  of  a  highly  parallel  cxjmputer  is  facilitated  by  a  simple 
programming  model  that  permits  writing  of  programs  using  variants  of  conventional 
high-level  procedural  languages.  Further  experience  is  needed  to  determine 
whether  the  operating  system  primitives  that  have  been  designed  to  date  will 
effectively  support  parallel  constructs  for  such  languages.  In  particular  it  is  crucial 
that  these  high-level  mechanisms  not  introduce  inefficiencies  that  would  prevent 
their  widespread  utility.  A  variety  of  approaches  to  parallel  control  and  scheduling 
is  being  studied  for  the  support  of  a  wide  range  of  parallel  appUcations. 
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